巴西专利BR102018067373A2 LEARNING MACHINE FOR IDENTIFYING TYPES OF CANDIDATE OBJECTS FOR VIDEO INSERTION

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
Machine learning for identifying candidate object types for video insertion. The method disclosure provides methods, systems, and computer programs for identifying candidate object types for video insertion through the use of machine learning. Machine learning is used for at least part of the processing of the image contents of a plurality of frames of a source video scene. processing includes identifying a candidate insertion zone for inserting an object into the image content of at least some of the plurality of frames and determining an insertion zone descriptor for the identified candidate insertion zone, the descriptor being An insertion zone element comprises a candidate object type indicative of an object type that is suitable for insertion into the candidate insertion zone.
公开号:BR102018067373A2
申请号:R102018067373-4
申请日:2018-08-31
公开日:2019-03-19
发明作者:Tim Harris；Philip McLauchlan；David Ok
申请人:Mirriad Advertising Plc；
IPC主号:

专利说明:

Invention Patent Descriptive Report for: MACHINE LEARNING TO IDENTIFY TYPES OF CANDIDATE OBJECTS FOR VIDEO INSERTION
Field of technique [001] The present disclosure refers to a system, method, software and apparatus for processing the image contents of a plurality of frames in a scene of a source video and for training the system and apparatus of the same.
Background [002] With the advent of digital file processing, it is possible to insert objects digitally (also called in this document to embed) in a video. Inserting objects digitally into a video can have many benefits, for example, enhancing the visual effects of the video, or improving the realism of a video, or allowing more flexibility for the video after it has been recorded, which means that few decisions need to be made taken with respect to objects that should be included in a scene at the scene's filming stage. Consequently, the insertion of digital objects is becoming more common and being used by video producers for all kinds of purposes.
[003] Currently, the insertion of a digital object typically requires several stages of processing. Although further described below, these can be broadly separated into:
Petition 870180124451, of 08/31/2018, p. 8/128
2/96 [004] 1. the detection of cuts;
[005] 2. the merger and grouping of similar scene shots;
[006] 3. the detection of insertion opportunities (interchangeably called insertion zones);
[007] 4. the contextual characterization of insertion zones; and [008] 5. the correspondence between insertion zones and objects for insertion.
Cut Detection [009] A program can typically be a half-hour or an hour-long show, and the program material is decomposed into scene shots. The scene shots are a consecutive sequence of frames that do not include any editing points, that is, they generally maintain a consistency that indicates that they were recorded by a single camera.
[010] They are outlined by cuts, in which the camera usually stops recording, or the material is edited to give that impression. Broadly speaking, there are two types of cuts: dry cuts and soft cuts. A dry cut is detected when the visual similarity between consecutive frames is abruptly interrupted, indicating an editing point or a change in the camera angle, for example. A smooth cut corresponds to the beginning or the end of a
Petition 870180124451, of 08/31/2018, p. 9/128
3/96 smooth transition, for example a cleaning or disappearing transition, characterized by a gradual but significant change in the visual appearance of the video across several frames.
[011] First, it may be necessary to analyze the source video material (for example, the program material), and find suitable scenes for object insertion. This is generally called a pre-analysis pass, and is best done by dividing the source video into scenes and, particularly, scenes taken from the same camera position. Segmentation of video material into scenes can typically be performed automatically, using the change of scene shot detection. A video analysis module can automatically detect dry and smooth cuts between different shots, which corresponds to dry and smooth transitions respectively.
Merging and Grouping of Similar Shots [012] Once a scene shot or scene shots have been detected, continuity detection can also be applied in an additional processing step to identify similar scene shots that were detected in the source video . In this way, when an insertion opportunity is identified in a scene shot, a scene similarity algorithm can identify additional scene shots in which it
Petition 870180124451, of 08/31/2018, p. 12/108
4/96 opportunity is likely to be present.
Insertion Zone Detection [013] The image regions in the source video content that are suitable for the insertion of additional material are called insertion zones, and these can be broadly categorized into surfaces and objects. In general, a surface may be suitable for inserting material. In the case of a wall, for example, a poster can be added. In the case of a table, an object such as a drink can be inserted. When an object is identified as an insertion zone, the opportunity for material insertion can be related to the rebranding of any trademark insignia identified in the product, replacement of the object with another object that belongs to the same class of objects, or the addition of an additional similar object next to the object.
[014] Detection of insertion zones can be sought and refined by tracking pixels that move coherently across the source video material. Image-based tracking techniques include, but are not limited to, plane tracking algorithms to calculate and model the 2D transformations of each image in the source video.
Contextual Characterization of Insertion Zones [015] An operator may be asked to access the
Petition 870180124451, of 08/31/2018, p. 12/118
5/96 insertion zone identified and provide context for possible additional material that can be inserted into it. With the rapid increase in the amount of digital video content that is broadcast or transmitted continuously over the internet, the fact that a human operator is not able to process insertion opportunities to identify context much faster than in real time can be a problem.
Correspondence Between Insertion Zones and Product Categories [016] It is not enough to just identify insertion opportunities through pattern recognition processes, some applied intelligence may also be required when selecting the material to be inserted in the video content.
[017] So that an object insertion instance does not diminish the viewing experience, it must make sense within the context of the source video content in which it is placed. If a scene takes place in a kitchen, for example, the additional content to be placed in that scene must be relevant to the objects that the viewer would expect to see in that location. For example, you might not expect to see a perfume bottle located on a kitchen sideboard next to a kettle. A coffee pot may be much more suitable in the context described. Likewise, a scene of
Petition 870180124451, of 08/31/2018, p. 12/128
6/96 bathroom is suitable for placing bathroom related items or hygiene items, instead of groceries. Consequently, it may be necessary for an operator to evaluate the scene to select a particular object or category of objects that would be suitable for insertion in any identified insertion zone. Again, the fact that a human operator is unable to process insertion opportunities to identify context much more quickly than in real time can be a problem.
[018] It can be seen from the above disclosure that identifying insertion zone opportunities and suitable objects for insertion can typically be a time-consuming, multi-stage process that can limit the amount of video material that can be analyzed .
Short description [019] In a first aspect of the present disclosure, a system is provided that comprises: a candidate insertion zone module configured to: receive a plurality of frames of a scene from a source video; and processing, at least in part using machine learning, image content from the plurality of frames to: identify a candidate insertion zone for the insertion of an object in the image content of at least some of the plurality of frames; and determine an insertion zone descriptor for the insertion zone
Petition 870180124451, of 08/31/2018, p. 12/13
7/96 identified candidate, and the insertion zone descriptor comprises a type of candidate object indicative of an object type that is suitable for insertion in the candidate insertion zone.
[020] The candidate insertion zone module may comprise: an identification sub-module configured to perform the identification of the candidate insertion zone and the determination of the insertion zone descriptor for the identified candidate insertion zone, and to: determine, for at least some of the pixels of the plurality of frames in the scene, an insertion probability vector comprising a probability value for each of a plurality of insertion marks, where each probability value is indicative of the chance that the type of insertion insertion indicated by the corresponding insertion mark is applicable to the pixel.
[021] The plurality of insertion markings may comprise a mark indicating that the pixel is not suitable for the insertion of an object; and one or more markings indicative of one or more corresponding types of object.
[022] The candidate insertion zone can comprise a plurality of pixels that have insertion probability vectors of which all have a maximum argument of probability values that correspond to a mark that is
Petition 870180124451, of 08/31/2018, p. 12/148
8/96 indicative of the type of candidate object.
[023] The candidate insertion zone module may comprise: a scene descriptor sub-module configured to process, by using machine learning, image contents of at least some of the plurality of frames to determine a scene descriptor, in that the determination of the type of candidate object is based at least in part on the scene descriptor.
[024] The identification of the candidate insertion zone can be based, at least in part, on the scene descriptor.
[025] The scene descriptor can comprise at least one global descriptor, where each global context descriptor is indicative of any of the following: Scene locations; State of mind; Demography; Human Action; Time of day; Seasons; Climate conditions; and / or Filming Location.
[026] The scene descriptor sub-module can be additionally configured to: receive audio content related to the original video scene; and determining the scene descriptor based, at least in part, on the received audio content.
[027] The scene descriptor can comprise at least one regional context descriptor indicative of an entity identified in the scene. The at least one regional context descriptor can be indicative of an entity identified in the scene, which is any one of: a human; one
Petition 870180124451, of 08/31/2018, p. 12/15
9/96 animal; a surface; or an object.
[028] The scene descriptor sub-module can be configured to process, using machine learning, image content from the plurality of frames to determine, for at least some of the pixels of the plurality of frames in the scene, a vector of probability of regional context comprising a probability value for each of a plurality of regional context markings, where each probability value is indicative of the chance that the type of entity indicated by the corresponding regional context mark is applicable to the pixel.
[029] The plurality of regional context markings can comprise: a mark indicating that the pixel is not related to anything; and at least one of: one or more indicative markings of a human; one or more indicative markings of an animal; one or more indicative markings on an object; and / or one or more markings indicating a surface.
[030] The candidate insertion zone module can additionally comprise: a database that comprises a contextually indexed library of insertion object types; where the determination of the candidate object type is based, at least in part, on the insertion object type library and on the scene descriptor.
[031] Alternatively, the zone module of
Petition 870180124451, of 08/31/2018, p. 12/168
10/96 candidate insertion can additionally comprise: an insertion zone and insertion object identification sub-module configured to identify the candidate insertion zone and the types of candidate objects by processing, using machine learning, content from image of the plurality of frames to determine, for at least some of the pixels of the plurality of frames in the scene, an insertion probability vector comprising a probability value for each of a plurality of insertion marks, where each probability value is indicative of the chance that the type of insertion mark corresponding to the insertion is applicable to the pixel. The plurality of insertion marks may comprise: a mark indicating that the pixel is not suitable for the insertion of an object; and one or more markings indicating that one or more corresponding types of object are suitable for insertion into the pixel. The candidate insertion zone can comprise a plurality of pixels that have insertion probability vectors of which all have a maximum argument of probability values that correspond to a mark that is indicative of the type of candidate object.
[032] In any of the system implementations identified above, the candidate insertion zone module may additionally comprise a post submodule
Petition 870180124451, of 08/31/2018, p. 12/178
11/96 processing configured to determine a duration of the candidate insertion zone through the plurality of frames and / or a size of the candidate insertion zone.
[033] The insertion zone descriptor may additionally comprise at least one of the duration of the candidate insertion zone through the plurality of frames, and / or the size of the candidate insertion zone.
[034] A post-processing sub-module can be additionally configured to determine a Video Impact Score based, at least in part, on the duration of the candidate insertion zone through the plurality of frames and / or a size of the candidate insertion zone.
[035] In any of the implementations of the system identified above, the system may additionally comprise: a segmentation module configured to: generate an insertion zone suggestion frame comprising a frame of the plurality of overlapping frames with a view of the insertion zone candidate insertion.
[036] In any of the implementations of the system identified above, the system can additionally comprise: an object insertion module configured to: select an object for insertion based on the candidate object type; and generate an object insertion suggestion frame comprising a frame of the plurality of frames and the selected object inserted in the
Petition 870180124451, of 08/31/2018, p. 12/188
12/96 candidate insertion.
[037] In any of the implementations of the system identified above, the candidate insertion zone module can be additionally configured to: receive feedback from an operator, where the feedback is indicative of the suitability of the identified candidate insertion zone and / or the type of candidate object for the image contents of the plurality of frames; and modify machine learning based, at least in part, on feedback.
[038] In a second aspect of the present disclosure, a method is provided to process the image contents of a plurality of frames from a scene of a source video, the method comprising: receiving the plurality of frames from the video scene source; and processing, at least in part using machine learning, image content from the plurality of frames to: identify a candidate insertion zone for the insertion of an object in the image content of at least some of the plurality of frames; and determining an insertion zone descriptor for the identified candidate insertion zone, the insertion zone descriptor comprising a candidate object type indicative of an object type that is suitable for insertion into the insertion zone
Petition 870180124451, of 08/31/2018, p. 12/198
13/96 candidate.
[039] In a third aspect of the present disclosure, a computer program is provided to execute the method of the second aspect when executed in the processor of an electronic device.
[040] In a fourth aspect of the present disclosure, an electronic device is provided comprising: a memory for storing the computer program of the third aspect; and a processor for executing the third aspect computer program.
[041] In a fifth aspect of the present disclosure, a method is provided to train a candidate insertion zone module to identify candidate insertion zones and one or more candidate objects for insertion into a source video scene, being that the method comprises: receiving a training corpus that comprises a plurality of images, each annotated with identification of at least one insertion zone and one or more types of candidate objects for each insertion zone; and train the candidate insertion zone module using machine learning and the training corpus to process image content from a plurality of frames from the source video to: identify a candidate insertion zone for the insertion of an object in the content of image of at least some among the plurality of frames; and
Petition 870180124451, of 08/31/2018, p. 12/20
14/96 determining an insertion zone descriptor for the identified candidate insertion zone, the insertion zone descriptor comprising one or more types of candidate objects indicative of one or more types of object that are suitable for insertion into the zone candidate insertion.
[042] At least some of the plurality of images in the training corpus can be annotated additionally with a scene descriptor, and in which the candidate insertion zone module can be additionally trained by using machine learning to: identify at least one scene descriptor for the image content of at least some of the plurality of frames; and determining the one or more types of candidate objects based, at least in part, on at least one identified scene descriptor.
[043] The fifth aspect method may comprise additionally determining one or more scene descriptors for at least some of the plurality of images in the training corpus by using a trained machine learning module configured to identify a scene descriptor by processing the content of an image; where training the candidate insertion zone module using machine learning further comprises training the candidate insertion zone module to: identify at least one scene descriptor for the image content of at least some of the plurality of
Petition 870180124451, of 08/31/2018, p. 12/21
15/96 frames; and determining the one or more types of candidate objects based, at least in part, on at least one identified scene descriptor.
Aspects of disclosure
[044] The non-limiting aspects of disclosure are
defined in the following numbered clauses.
[045] 1. A system that comprises: [046] a candidate insertion zone module
configured for:
[047] receive a plurality of frames of a scene
a source video; and
[048] process, at least in part using
machine learning, image content from
[049] plurality of frames for: [050] identify a candidate insertion zone
for the insertion of an object in the image content of at least some of the plurality of frames; and
[051] determine an insertion zone descriptor
for the identified candidate insertion zone, the insertion zone descriptor comprising one or more types of candidate objects indicative of one or more types of object that are recommended for insertion into the candidate insertion zone.
[052] 2. The system according to clause 1, in that the module candidate insertion zone
Petition 870180124451, of 08/31/2018, p. 12/22
16/96 additionally:
[053] an zone in insertion and submodule in identification in object in insertion configured for identify the zone candidate insertion and types in
candidate objects through the processing, through the use of machine learning, of image contents of the plurality of frames to determine, for each of at least some of the pixels of the plurality of frames in the scene, a vector of insertion probability that comprises a probability value for each of a plurality of insertion points, where each probability value is indicative of the probability that the corresponding insertion point is applicable to the pixel.
[054] 3. The system according to clause 2, in which the plurality of insertion markings comprises:
[055] a mark indicating that the pixel is not suitable for the insertion of an object; and [056] one or more markings indicating that one or more corresponding types of object are suitable for insertion into the pixel.
[057] 4. The system according to clause 2 or clause 3, in which the candidate insertion zone comprises a plurality of pixels that have insertion probability vectors that all have a maximum argument of probability values that correspond to a mark that is
Petition 870180124451, of 08/31/2018, p. 12/23
17/96 indicative of one or more types of candidate objects.
[058] 5. The system according to clause 1, in which the candidate insertion zone module additionally comprises:
[059] a scene descriptor sub-module configured to process, using machine learning, image contents of at least some of the plurality of frames to determine a scene descriptor;
[060] a database comprising a contextually indexed library of insertion object types; and [061] an identification submodule configured for:
[062] receiving the scene descriptor from the scene descriptor sub-module;
[063] identify, by use in learning in machine, the insertion zone candidate fur use of descriptor scene; and [064] determine, by use in learning in
machine c ) insertion object in candidate by use, by least of type library in object in insertion and the descriptor scene. [065] 6. The system of a deal with The clause 5, in
that the machine learning sub-module is additionally configured to:
Petition 870180124451, of 08/31/2018, p. 12/24
18/96 [066] receive audio content related to the source video scene; and [067] determining the scene descriptor based, at least in part, on the received audio content.
[068] 7. The system according to clause 5 or clause 6, in which the scene descriptor comprises at least one global descriptor, in which each global context descriptor is indicative of any of:
[069] Scene location;
[070] State of mind;
[071] Demography;
[072] Human Action;
[073] Time of day;
[074] Season of the year.
[075] 8. The system according to any of clauses 5 to 7, in which the scene descriptor comprises at least one regional context descriptor indicative of an entity identified in the scene.
[076] 9. The system according to any one of clauses 5 to 8, in which the identification sub-module is configured to determine based on the scene descriptor and insertion object type library, for each one of at least some of the pixels of the plurality of frames in the scene, an insertion probability vector that comprises a probability value for each of a plurality
Petition 870180124451, of 08/31/2018, p. 12/25
19/96 of insertion markings, where each probability value is indicative of the probability that the corresponding insertion mark is applicable to the pixel.
[077] 10. The system according to clause 9, in which the plurality of insertion markings comprises:
[078] a mark indicating that the pixel is not suitable for the insertion of an object; and [079] one or more markings indicating that one or more corresponding types of object are suitable for insertion into the pixel.
[080] 11. The system according to any previous clause, in which the candidate insertion zone module additionally comprises a post-processing sub-module configured to determine a duration of the candidate insertion zone through the plurality of frames and / or a size of the candidate insertion zone.
[081] 12. The system according to clause 11, in which the insertion zone descriptor additionally comprises at least one of the duration of the candidate insertion zone through the plurality of frames and / or the size of the insertion zone. candidate insertion.
[082] 13. The system according to clause 11 or clause 12, in which the post-processing sub-module is additionally configured to determine a Video Impact Score based, at least in part, on the
Petition 870180124451, of 08/31/2018, p. 12/26
20/96 duration of the candidate insertion zone through the plurality of frames and / or a size of the candidate insertion zone.
[083] 14. The system according to any previous clause, which further comprises:
[084] a segmentation module configured to:
[085] generate an insertion zone suggestion frame that comprises a picture of the plurality of overlapping frames with a visualization of the candidate insertion zone and at least one among one or more types of candidate objects.
[086] 15. The system according to any previous clause, which further comprises:
[087] an object insertion module configured for:
[088] select an object for insertion based on one or more types of candidate objects; and [089] generate an object insertion suggestion frame comprising a frame of the plurality of frames and the selected object inserted in the candidate insertion zone.
[090] 16. A method for processing the image contents of a plurality of frames of a scene from a source video, the method comprising:
[091] receive the plurality of frames from the scene of the original video; and [092] process, at least in part using
Petition 870180124451, of 08/31/2018, p. 12/27
21/96 machine learning, image content of the plurality of frames for:
[093] identify a candidate insertion zone for the insertion of an object in the image content of at least some among the plurality of frames; and [094] determining an insertion zone descriptor for the identified candidate insertion zone, the insertion zone descriptor comprising one or more types of candidate objects indicative of one or more types of object that are recommended for insertion into the candidate insertion zone.
[095] 17. A computer program to execute the method of clause 16 when executed in the processor of an electronic device.
[096] 18. An electronic device comprising:
[097] a memory for storing the computer program of clause 17; and [098] a processor for executing the computer program of clause 17.
[099] 19. A method for training a candidate insertion zone module to identify candidate insertion zones and one or more candidate objects for insertion into a scene from a course video, the method comprising:
[100] receive a training corpus that comprises a plurality of images, each annotated with
Petition 870180124451, of 08/31/2018, p. 12/28
22/96 identification of at least one insertion zone and one or more types of candidate objects for each insertion zone;
[101] train the candidate insertion zone module, using machine learning and the training group, to process image content from a plurality of frames from the source video to:
[102] identify a candidate insertion zone for the insertion of an object in the image content of at least some among the plurality of frames; and [103] determining an insertion zone descriptor for the identified candidate insertion zone, the insertion zone descriptor comprising one or more types of candidate objects indicative of one or more types of object that are recommended for insertion into the candidate insertion zone.
[104] 20. The method according to clause 19, in which at least some of the plurality of images in the training corpus are each annotated additionally with a scene descriptor, and in which the candidate insertion zone module is additionally trained by using machine learning to:
[105] identify at least one scene descriptor for the image content of at least some of the plurality of frames; and
Petition 870180124451, of 08/31/2018, p. 12/29
23/96 [106] determining the one or more types of candidate objects based, at least in part, on at least one identified scene descriptor.
[107] 21. The method according to clause 19 which further comprises:
[108] Determine one or more scene descriptors for at least some of the plurality of images in the training corpus using a trained machine learning module configured to identify a scene descriptor by processing the content of an image; where [109] training the candidate insertion zone module by using machine learning comprises, in addition, training the candidate insertion zone module to:
[110] identify at least one scene descriptor for the image content of at least some of the plurality of frames; and [111] determining the one or more types of candidate objects based, at least in part, on at least one identified scene descriptor.
Figures [112] Additional functions and advantages of the present disclosure will become evident from the following description of a modality thereof, presented only as an example, and as a reference to the drawings, in
Petition 870180124451, of 08/31/2018, p. 12/30
24/96 that similar numerical references refer to similar parts, and in which:
[113] Figure 1 shows an exemplary schematic representation of a system in accordance with an aspect of the present disclosure;
[114] Figure 2 shows an exemplary process performed by the system in Figure 1;
[115] Figure 3 shows an example list of object types for insertion in a candidate insertion zone;
[116] Figure 4 shows an example insertion zone suggestion table;
[117] Figure 5 shows an exemplary objection suggestion table;
[118] Figure 6 shows a first example schematic representation of a configuration of the candidate insertion zone module of the system of Figure 1;
[119] Figure 7a shows exemplary attributes that can be used to describe a detected Human more precisely;
[120] Figure 7b shows exemplary attributes that can be used to describe an object detected more precisely;
[121] Figure 7c shows exemplary attributes that can be used to describe a Surface
Petition 870180124451, of 08/31/2018, p. 12/31
25/96 detected more precisely;
[122] Figure 7d shows exemplary attributes that can be used to describe the location of a scene;
[123] Figure 8 shows exemplary steps in a process for training machine learning for the scene descriptor sub-module of the system in Figure 1;
[124] Figure 9 shows a second example schematic representation of a candidate insert zone module configuration of the system in Figure 1;
[125] Figure 10 shows an example representation of a training system to train the
candidate zone insertion module of the system of Figure 1; [126] The figure 11 shows the results intermediaries of a CNN in different stages after power supply An image. description Detailed
[127] The present disclosure refers to a technique for using machine learning to identify insertion zones in a video scene and corresponding object types for insertion in the insertion zone. Candidate object types are object types that are suitable for insertion, and can be, for example, object classes such as soda bottle, alcohol bottle, vehicle, cell phone, etc., or they can be
Petition 870180124451, of 08/31/2018, p. 12/28
26/96 more specific, such as private trademarks for private objects.
Large-scale Generation of Insertion Opportunities Inventory [12 8] By using machine learning to process the image contents of a plurality of frames, to identify a candidate insertion zone, and a corresponding insertion zone descriptor comprising one or more types of candidate objects, the speed of identifying insertion zone opportunities and suitable objects for insertion can be increased significantly. In particular, an operator can directly review the candidate insertion zone and recommended object types for insertion, without having to do any analysis of the scene's content on its own. The one or more descriptors in the insertion zone can very quickly give an indication of what types of objects can be inserted into a scene (and optionally for how long they can be visible), at which point further investigation and / or object insertion can happen. For example, a source video can comprise eight different scenes, and one or more candidate insertion zones and corresponding descriptors can be returned for each. So, without any operator time or effort, it can be very quickly understood which scenes can
Petition 870180124451, of 08/31/2018, p. 12/33
27/96 be suitable for object insertion and what types of objects could be inserted in such scenes. Additional processing and / or operator time can then be focused on only the most promising scenes (for example, in which objects inserted will be visible for the longest time and / or which are suitable for types of object that are of particular interest, such as types of objects that a director has indicated they would like to see inserted in the source video, etc.). Consequently, the increasing volume of video content being generated can be evaluated more quickly and the operator's time focused only on the most suitable scenes for object insertion.
Workflow [129] Figure 1 shows an exemplary schematic representation of a system 100 in accordance with an aspect of the present disclosure. The system 100 comprises a candidate insertion zone module 110, a scene detection module 120, a segmentation module 130, an objection insertion module 140 and a database 150.
[130] Figure 2 shows an exemplary process performed by system 100 to identify at least one insertion zone for a scene from the source video, and to determine a corresponding insertion zone descriptor.
[131] In step S210, the scene detection module 12 0 obtains a source video. The source video can
Petition 870180124451, of 08/31/2018, p. 12/34
28/96 comprise one or more digital files, and the scene detection module 12 0 can obtain the source video, for example, through a high-speed computer network connection, the internet, or from a device computer-readable hardware storage. The source video comprises frames of video material, which can be grouped into scene shots or scenes if recorded by the same camera, or established in a particular location.
[132] Scene detection module 120 can perform pre-analysis on the source video to create a sequence of scene shots or similar scenes that may be suitable for object insertion. The pre-analysis can be completely automated, so that it does not involve any human intervention. Pre-analysis can include using a scene detection function to identify the boundaries between different scene shots in the source video. For example, the scene detection module 120 can automatically detect dry and smooth cuts between different scene shots, which correspond to dry and smooth transitions respectively. The dry cuts correspond to an abrupt change in the visual similarity between two consecutive frames in the source video. The smooth cuts correspond to the beginning or the end of a smooth transition (for example, cleaning and cross-disappearing transitions), which can be characterized by a significant change,
Petition 870180124451, of 08/31/2018, p. 12/35
29/96 but gradual, in visual appearance through several pictures. Other pre-analysis techniques known in the art can be employed, such as continuity detection, point tracking or plane tracking, 3D tracking, autokeying, region segmentation, etc.
[133] In Step S220, candidate insertion zone module 110 processes the contents of a plurality of frames in a scene identified by scene detection module 120. It will be noted at this point that while system 100 represented in Figure 1 comprises the scene detection module 120, scene detection module 120 is optional and in an alternative implementation the candidate insertion zone module 110 can receive a plurality of frames of a scene from the source video from an entity on the side of outside system 100, for example, via a high-speed computer network connection, the internet, or from a computer-readable hardware storage device.
[134] Candidate insertion zone module 110 processes the contents of a plurality of frames from a source video scene to identify one or more candidate insertion zones in the image content of the frames. The content of all of the plurality of frames in a scene can be processed, or a subset of the plurality of
Petition 870180124451, of 08/31/2018, p. 12/36
30/96 frames (for example, processing speeds can be increased by analyzing little rather than all of the plurality of frames, such as processing each second frame, or by analyzing the similarity of frames to identify groups of similar frames within a scene and identify only one or some, but not all, of the frames in each similar group, etc.). Each candidate insertion zone is suitable for inserting an object (or objects) in the image content of at least some of the scene's plurality of frames. Candidate insertion zone module 110 also determines an insertion zone descriptor for each of the identified candidate insertion zones. Each insertion zone descriptor comprises one or more types of candidate objects indicative of one or more types of object that are suitable for insertion into the corresponding candidate insertion zone (for example, the types of candidate objects may be indicative of a recommendation, or suggestion, or prediction of one or more types of object for insertion in the corresponding candidate insertion zone). The insertion zone descriptor may also comprise additional information indicative of the duration of the candidate insertion zone (for example, one or more of: the amount of time the candidate insertion zone is present during the scene, the size of the insertion zone insertion, centrality in relation to the image, etc.). Details
Petition 870180124451, of 08/31/2018, p. 37/128
31/96 additionals in different ways in which candidate insertion zone module 110 can be configured to determine the insertion zone descriptor are explained later.
[135] A candidate insertion zone is a region of a scene's image content that is suitable for insertion of an object. As explained previously, a candidate insertion zone can correspond to a table in the scene's image content, which can be suitable for inserting any type of object that can be placed on a table, for example a lamp or a bottle of soda. Alternatively, a candidate insertion zone may correspond to a wall, which may be suitable for inserting a poster. Alternatively, a candidate insertion zone can correspond to an object in the scene, for example a coffee jar or vehicle, which may be suitable for inserting a branding change object, in order to change the object's branding in the scene.
[136] As explained above, the insertion zone descriptor can comprise information indicative of the duration of the insertion zone. The duration of a candidate insertion zone is the time for which the candidate insertion zone is present within the scene. As an example without limitation, during a scene that lasts for 30 seconds, a character can open the door of a refrigerator, disclosing
Petition 870180124451, of 08/31/2018, p. 12/38
32/96 a shelf inside the refrigerator that can be identified as a candidate insertion zone. Five seconds later, the character can close the refrigerator door. In this particular example, the duration of the candidate insertion zone is five seconds, since it is visible within the scene for only five seconds. The information in the insertion zone descriptor can be indicative of the duration of the candidate insertion zone in any suitable way, for example by indicating the time for which the insertion zone is present within the scene in units of hours, and / or minutes, and / or seconds, and / or milliseconds, etc., or by indicating the number of frames in the scene in which the insertion zone is present (from which the duration time can be derived by using the video frame rate of origin), etc.
[137] The one or more types of candidate objects can take any appropriate form depending on the Implementation particular of the insertion zone module
candidate 110 and / or the requirements of the owner / operator of system 100. For example, the one or more types of candidate objects may comprise particular categories of objects that can be inserted into the candidate insertion zone. An example list of 20 different categories of objects is given in Figure 3, from which the one or more types of candidate objects can be selected (for example, the
Petition 870180124451, of 08/31/2018, p. 12/39
33/96 candidate insertion zone can be a counter in a kitchen, and the one or more types of candidate objects can comprise Food; Soft drinks; Hot beverages. The one or more types of candidate objects may, additionally or alternatively, be indicative of particular candidate objects for insertion, for example Trademark X Soda Can; Trademark Y Coffee Bag; Trademark Z Kettle; etc.).
[138] After having identified the one or more candidate insertion zones and the corresponding one or more insertion zone descriptors, in Step S230, the candidate insertion zone module 110 can output the identification of the candidate insertion zones as output. and one or more insertion zone descriptors from system 100. Additionally or alternatively, in Step S230, candidate insertion zone module 110 can pass the identification of one or more candidate insertion zones and descriptors of the insertion zone insert for segmentation module 130 and / or objection insertion module 140.
[139] In optional step S240, segmentation module 130 selects a frame from the scene that includes the candidate insertion zone (for example, you can select any arbitrary frame that includes the candidate insertion zone, or the first frame in the scene which includes the candidate insertion zone, or the central frame in the scene
Petition 870180124451, of 08/31/2018, p. 40/128
34/96 that includes the candidate insertion zone, or the last frame in the scene that includes the candidate insertion zone, etc.) and superimposes a visualization of the candidate insertion zone on the selected frame, in order to create a suggestion frame for insertion zone. The display overlay of the candidate insertion zone can be performed, for example, based on pixel marking, in which the candidate insertion zone module 110 marked the pixels in the scene frames to identify each pixel as part, or not, of a candidate insertion zone, so that the segmentation module can readily identify the boundaries of any candidate insertion zones. The insertion zone suggestion frame may also comprise a view of one or more types of candidate objects (for example, text superimposed on the frame that identifies one or more types of candidate objects) and / or a view of any of the others information in the insertion zone descriptor, such as text superimposed on the frame that identifies the amount of time and / or number of frames for which the candidate insertion zone is present in the scene). The superimposed view of a candidate insertion zone can take the form of a colored area in the image content of a scene frame, whose borders correspond with the borders of the candidate insertion zone.
[140] Figure 4 shows an example table to be
Petition 870180124451, of 08/31/2018, p. 41/128
35/96 starting from a scene 410 and an exemplary insertion zone suggestion frame 420, which is the same as the exemplary framework of scene 410, but with a view of a candidate insertion zone 425 superimposed on the scene. As will be noted, the insertion zone suggestion table 420 can help an operator to quickly understand the characteristics and possibilities of the candidate insertion zone 425, for example, how prominent it can be within the scene, what type of objects may be suitable for insertion and / or for how long such objects can be visible within the scene, etc. Consequently, the speed with which an assessment of the potential value of the candidate insertion zone and subsequent object insertion can be considerably increased, since a source video or a scene from a source video can be inserted into system 100, and a readily intelligible representation of a candidate insertion zone (or zones) and object insertion opportunities within such candidate insertion zone (or zones) are quickly generated as output from system 100.
[141] In optional step S250, object insertion module 140 performs an operation similar to segmentation module 130, except that instead of generating an insertion zone suggestion frame 420, it generates an insertion suggestion frame of object. This can be very similar to the suggested insertion zone 420 table, but
Petition 870180124451, of 08/31/2018, p. 42/128
36/96 instead of superimposing a view of the candidate insertion zone, the object insertion suggestion frame can comprise a scene frame with an object inserted in the candidate insertion zone. In this way, a model of the insertion opportunity can be created.
[142] For this purpose, object insertion module 140 can be configured to select an object for insertion from database 150, which can comprise a library of object graphics for insertion, based on the one or more more types of candidate objects, and insert the selected object in the frame. The object graphics library can be indexed by object type so that the object to be inserted can be any object that matches one or more types of candidate objects in the insertion zone descriptor (for example, if the insertion zone identifies Beverage, Soft drinks as a candidate object type, any type of soft drink object in database 150 can be selected and inserted in the frame to create the object insertion suggestion frame). Optionally, the object insertion module 140 can generate a plurality of different object insertion suggestion frames, each of which comprises a different object, so that the visual appearance of different objects inserted in the scene can be readily observed. Still optionally, instead of
Petition 870180124451, of 08/31/2018, p. 43/128
37/96 inserting a complete representation of the object, object insertion module 140 can insert a shape (for example, a colored box or cylinder, etc.) that roughly matches the shape of a generic object that matches the candidate object type . This can help with visualizing what the scene might look like after inserting an object, without limitation to a specific object that is within the candidate object type.
[143] Figure 5 shows an example picture from scene 510 and an example picture of suggested insertion of object 52 0, which is the same as the example picture from scene 510, but with a suggested object 525 inserted into the scene. Optionally, the 520 object insertion suggestion frame may also comprise a view of any other information in the insertion zone descriptor (for example, text superimposed on the frame that identifies the amount of time and / or number of frames by which the zone candidate insertion is present in the scene). It will be noted that the 525 object insertion suggestion board can help you quickly see what the scene might look like with a suitable object inserted in it. Furthermore, if a determination is made to insert the object into the scene, it can help to speed up the insertion process, since the operator can understand very quickly how and where the object can be inserted.
Petition 870180124451, of 08/31/2018, p. 44/128
38/96 [144] Based on candidate insertion zone (or zones) and candidate object types (or types), and / or the object insertion suggestion table and / or the insertion area suggestion table , one or more objects that correspond to the type (or types) of the object indicated by the type of candidate object can be inserted into the scene of the source video, so that they can appear within the image content of the scene frames. For example, an operator can decide whether or not to proceed based on the candidate insertion zone (or zones) and candidate object types (or types), and / or the object insertion suggestion table and / or the suggested insertion zone table. If they decide to proceed, an object (or objects) can be inserted according to any standard techniques known to those skilled in the art. If they decide not to proceed, nothing can additionally happen. Alternatively, the insertion of an object (or objects) of the type indicated by the type (or types) of the candidate object can happen automatically after the candidate insertion zone (or zones) and types (or types) of candidate objects have been determined.
[145] The candidate insertion zone module 110 uses machine learning techniques to perform at least some of the steps necessary to process the image contents of the plurality of frames in a scene to identify at least one candidate insertion zone.
Petition 870180124451, of 08/31/2018, p. 45/128
39/96 in the scene and at least one corresponding insertion zone descriptor. There are several different ways in which candidate insertion zone module 110 can be configured to use machine learning for this purpose, which are summarized below as an indirect approach or a direct approach. The exemplary configurations of the candidate insertion zone module 110, according to each
one of the approach indirect and the approach direct, are described below with reference to Figures 6 and 9. Approach indirect [146] The figure 6 shows a representation
Example schematic of a candidate insertion zone module 110 configuration to perform the indirect approach of identifying a candidate insertion zone and determining an insertion zone descriptor. The candidate insertion zone module 110 comprises a scene descriptor sub-module 610, an identification sub-module 620, a database 630 comprising a library of insertion object types and a post-processing sub-module 640. The database data 630 can be the same as database 150, or it can form part of database 150 (or database 150 can form part of database 630), or it can be completely separate from database 150.
Regional Context Descriptor
Petition 870180124451, of 08/31/2018, p. 46/128
40/96 [147] The scene descriptor sub-module 610 is configured to process the image content of a plurality of scene frames by using machine learning to determine scene descriptors. Scene descriptors can comprise at least one regional context descriptor and / or at least one global context descriptor.
[148] A regional context descriptor can be indicative of what kind of thing a part of the image content of the plurality of frames is. For example, a part identified within the image can be semantically identified in any of the four regional context descriptor classifications: (1) Human, (2) Animal, (3) Surface, (4) Object. Where a part of an image has been identified as being part of one of the four regional context descriptor classifications, that part of the image can be characterized more precisely by the use of attributes associated with such a regional context descriptor classification.
[149] Figure 7a, for example, shows attributes that can be used to describe a detected Human more precisely. In this particular example, a Human can be described more precisely with two different types of attributes: Gender and Age. However, it will be noted that any number of other types of attributes can additionally or alternatively be used, for example:
Petition 870180124451, of 08/31/2018, p. 47/128
41/96 ethnicity, hair color, etc. Additionally or alternatively, attributes can identify particular actors or characters so that they can be tracked through scene shots of a sequence. To that end, one of a large number of readily available facial recognition packages can be used to identify characters and / or actors, using Fisher vectors, for example. Fisher vectors are described in K. Simonyan, A. Vedaldi, A. Zisserman. Deep Fisher networks for large-scale image classification Proc. NIPS, 2013.
[150] Figure 7b, for example, shows attributes that can be used to describe a detected object more precisely. Again, these attributes are shown only as a non-limiting example, and any other suitable Object attributes can, additionally or alternatively, be used. In addition, an identified object is described although in this example with only one type of object attribute, it can alternatively be described using two or more different types of attributes, for example:
the object's category (such as beverage cans, magazine, car, etc.) and the object's trademark.
[151] that can
Figure 7c, for example, shows attributes to be used to describe a detected Surface more precisely. Again, these attributes are
Petition 870180124451, of 08/31/2018, p. 12/28
42/96 shown only as a non-limiting example and any other suitable Surface attributes can, additionally or alternatively, be used. Furthermore, although in this example an identified Surface is described with only one type of Surface attribute, it can, alternatively, be described using two or more different types of Surface attributes.
Pixel Marking to Determine Regional Context Descriptors [152] The machine learning sub-module 610 can be configured to determine the one or more regional context descriptors in any suitable manner. In a particular example, it can be configured to annotate each of at least some of the pixels in a plurality of scene frames (or each of at least some of the pixels in a plurality of scene frames, as explained in more detail later) ) with a regional context probability vector. While it may be preferable for each of at least some of the pixels that should be annotated with a regional context probability vector, for resolution reasons, in an alternative implementation each regional context probability vector can be related to a group of two or more pixels. For example, the pixels that make up a frame can be grouped into a series of subsets, with
Petition 870180124451, of 08/31/2018, p. 49/128
43/96 each subset comprises two or more pixels. In this case, each subset can be annotated with a regional context probability vector. Consequently, machine learning sub-module 610 can be configured to annotate at least some of the pixels (either individually or in subset groups) with regional context probability vectors. The regional context probability vector can comprise a probability value for each of a plurality of regional context markings, where each probability value is indicative of the chance that the type of entity indicated by the corresponding regional context mark is applicable to such a pixel (or pixels) (for example, the values in the regional context probability vector may be indicative of a relative 'score' for each of the markings, representing the relative chance of each of the markings being applicable to that pixel ( or pixels)). A non-limiting example of a regional context probability vector for a pixel is as follows:
c = [0.1, 0.05, 0, 0.05, 0, 0.05, 0.05, 0.05, 0.4, 0.15,
0.1] [153] Each of the items in vector c corresponds to a regional context mark, where each regional context entity is indicative of a different type of entity. In this particular example, the
Petition 870180124451, of 08/31/2018, p. 50/128
44/96 regional context are:
[154] Not a 'thing', Man under the age of
45, Man over the age of 45, Woman over the age of
45, Woman over 45, Animal, Table Top, Kitchen Counter Top, Vehicle, Computer, Book
[155] Thus, each of the context markings
regional for the pixel in this example have the following probability values:
[156] Not a 'thing' = 0.1 [157] Man under the age of 45 = 0.05 [158] Male aged over 45 = 0 [159] Women under the age of 45 = 0.05 [160] Woman over 45 = 0 [161] Animal = 0.05 [162] Table top = 0.05 [163] Kitchen Counter Top = 0.05 [164] Vehicle = 0.4 [165] Computer = 0.15 [166] Book = 0.1 [167] So, it can be seen that there are four
markings related to the Human classification (being that
each tag is an attribute related to Humans), a
markings related to the Animal classification, two markings related to the Surface classification (each mark being an attribute related to Surfaces)
Petition 870180124451, of 08/31/2018, p. 51/128
45/96 and three markings related to the Object classification
(being that ; each marking is one attribute related to Objects). [168] THE marking No an 'thing' indicates the chance that the pixel don't belong The any one of others
regional context markings, that is, the pixel (or pixels) does not refer to anything. The probability of marking Not a 'thing' can be adjusted to: 1 minus the sum of all other probabilities in the regional context vector. Consequently, the sum of all probabilities in the regional context probability vector must be 1.
[169] Therefore, in this example, the regional context markup that has a probability with the largest argument (that is, the highest probability) is 'Vehicle'.
Thus, the regional context markup considered most likely to be applicable to the pixel (or pixels) is 'Vehicle' (ie, such pixel (or pixels) is considered to be most likely part of a vehicle).
[17 0] Although each of the probabilities in the regional context probability vector in this particular example is between 0 to 1, with larger values indicating greater chance, it will be noted that the regional context probability vector can take any other suitable form that is indicative of the chance that the type of entity indicated by the corresponding regional context marking
Petition 870180124451, of 08/31/2018, p. 52/128
46/96 is applicable to a pixel (s). For example, a regional context probability vector may comprise probability values between 0 to 20, or between 10 to 20, or between 0 to 100, etc., where each value is indicative of the relative chance that the type of entity indicated by the corresponding regional context markup is applicable to a pixel (s). It can therefore be seen also that the probabilities do not necessarily have to be 1 as a result of the sum, but may alternatively have any other suitable value as a result of the sum.
[171] Although, in the above, there is a particular example of a regional context probability vector, it will be noted that machine learning sub-module 610 can be configured to determine regional context probability vectors that comprise any number of probability values that correspond to regional context markings, for example 10000 or 10000 of probability values that correspond to 10000 or 10000 of regional context markings.
[172] By determining the regional context probability vectors for pixels in the frames, an understanding of what 'things' are in a frame's image content, and their relative positioning, can be achieved. For example, a region of the frame where all pixels are annotated with regional context probability vectors
Petition 870180124451, of 08/31/2018, p. 53/128
47/96 with maximum arguments of probability values that correspond to 'Animal' you are likely to have an animal. A different region of the frame in which all pixels have regional context probability vectors with maximum arguments that correspond to 'Table top' is likely to have a table top. Due to the fact that the position of each pixel in the frame is known, the proximity of the animal and the table top can also be known. Thus, it can be said that the image contents of the painting include a table top and an animal, and their proximity to each other is noticeable.
[173] It will be noted that regional context probability vectors can not only be used to identify which 'things' are in a frame's image content, and their proximity to each other, they can be used to determine how many 'things' are within the image content of the frame. For example, the total number of 'things' of any type can be determined and / or the total number of each different type of 'thing' can be determined (for example, number of humans, number of animals, number of soda cans , etc.). This can be useful for a variety of purposes, such as determining a global context descriptor and / or determining types of candidate objects (as explained in more detail later).
[174] In addition, the pixels that are identified
Petition 870180124451, of 08/31/2018, p. 54/128
48/96 by the regional context probability vectors as part of a surface can be indicative of a candidate insertion zone. Likewise, the pixels identified by the regional context probability vectors as being part of an object can also be indicative of a candidate insertion zone (since the branding of the identified object, for example, can be changed by insertion object). Thus, regional context probability vectors can not only provide additional information about 'things' within image content, but can also be used to identify potential insertion zones and their proximity to other 'things' identified in the content of Image.
Global Context Descriptor [175] A global context descriptor is indicative of a general context of the image contents of the plurality of frames. One or more different global context descriptors can be determined by the machine learning sub-module, each corresponding to a different global context classification. Non-limiting examples of global context classifications are: Location, Human Action, Demography, State of Mind, Time of Day, Season of the Year (for example, spring, summer, autumn, winter), climate, filming location, etc.
[176] Figure 7d, for example, shows a set
Petition 870180124451, of 08/31/2018, p. 55/128
49/96 of attributes that can be used to describe the location of a scene. In this example, 41 different types of location are listed, although it is noted that scene descriptor sub-module 610 can be configured to determine a location context descriptor for a scene from a list of any number of different location attributes. In addition, although the list in Figure 7d identifies generic location attributes, more specific location attributes can additionally or alternatively be used, for example private rooms or places that occur regularly within a film or television series can be location attributes, such as like a room for a particular character, or a kitchen for a particular family, etc.
[177] The scene descriptor 610 sub-module can determine at least one global context descriptor by using machine learning in any appropriate manner. In one example, for at least one frame of a scene, the scene descriptor sub-module 610 can use machine learning to determine at least one global context probability vector. Each global context probability vector for a table can correspond to a different classification of global context descriptor (for example, Location, State of mind, etc.) and can comprise a plurality of probabilities, each corresponding to a mark different global context (each tag
Petition 870180124451, of 08/31/2018, p. 56/128
50/96 global context is an attribute for the particular classification of global context descriptor). Based on the example shown in Figure 7d, a global context probability vector that corresponds to the Local classification can comprise 41 probability values that correspond to 41 different attributes listed in Figure 7d. The probability values in the global context probability vector are indicative of the chance that the different attributes listed are applicable to the scene. Each probability value can be between 0 to 1, or can take any other suitable form indicative of relative chance, for example, values between 0 to 20, or between 10 to 20, or between 0 to 100, etc. The probabilities in each global context probability vector can optionally have 1 as a result of the sum, or any other suitable value. The attribute with the highest corresponding probability argument for each global context probability vector can then be considered as the attribute that best describes the overall context of the scene. For example, if for a global context probability vector related to Local, the maximum probability argument corresponds to the attribute Urban street of day outdoors day, the global context descriptor can comprise Local {Urban street of day outdoors day}. If it is a global context probability vector related to mood, the argument of
Petition 870180124451, of 08/31/2018, p. 57/128
51/96 maximum probability corresponds to the Happy attribute, the global context descriptor can also understand {Happy} mood, etc. Thus, the global context descriptor may comprise one or more vectors of global context probability, and / or one or more attributes chosen for each type of global context (for example, Local {Street urban daytime open day}, State of {Happy} spirit, etc.).
[178] Global context descriptors can be determined by using machine learning to determine them directly by processing the image content of a plurality of frames, or by deriving them from regional context descriptors. For example, it may be possible to infer suitable attributes for one or more global context descriptors based on one or more regional context descriptors for a frame's image content. As an example, if we consider the following attributes identified for the Object, Surface and Human regional context classifications in the image content of a frame:
[179] Object {sink, bottle, cereal box} [180] Surface {table, counter top, wall} [181] Human {woman, widow} [182] it can be inferred that a suitable attribute for the classification of Local global context is cuisine.
Petition 870180124451, of 08/31/2018, p. 12/58
52/96 [183] Likewise, as an additional example, if attributes of a regional context descriptor, such as road and bank, are determined, it can be inferred
that an attribute suitable for classification in context global Local it's outdoors. 0 number in obj etos identified within the content in Image of painting,
particularly the number of types of particular objects, can also be indicative of particular global context attributes.
[184] In addition to processing the image content of a plurality of frames, in order to determine the scene descriptors, the machine learning sub-module 610 can optionally also process audio data corresponding to the frames. This can improve the reliability of determination. For example, firearm shots are usually perceived as bad, and therefore can provide strong signals to the attributes of the Human Action and / or State of Mind classifications of global context descriptors. Likewise, laughter can provide a signal as to the happiness attribute of the global context descriptors mood, screaming can provide a signal as to the excitement attribute of the global context descriptors mood, etc.
[185] Scene descriptors are passed on to the
Petition 870180124451, of 08/31/2018, p. 59/128
53/96 Identification submodule 620, which uses machine learning to identify one or more candidate insertion zones in the image content based on the scene descriptors, and to determine an insertion descriptor for each. They can be passed to the identification sub-module 620 in the form of a plurality of annotated tables that are annotated with the regional context probability vectors and / or global context probability vectors described above, and / or annotated with the descriptor ( or descriptors) of the scene most relevant to the scene (for example, the global context attribute chosen for each type of global context, etc.).
[186] As explained earlier, the regional context probability vectors can be indicative of parts of the image content could be insertion zones, for example regions that refer to a Surface or an Object. Through machine learning, identification sub-module 620 may be able to identify which of these regions are best suited to be candidate insertion zones (for example, based on their size, positioning on the board, positioning in relation to other 'things 'within the framework identified by the regional context descriptors, etc.).
Demography Context Descriptor [187] The identification sub-module 620 can also
Petition 870180124451, of 08/31/2018, p. 60/128
54/96 determine one or more types of candidate objects for the insertion zone descriptor for each candidate insertion zone. This can be determined, for example, based on at least the scene descriptors and a library of insertion object types stored in the 630 database, which are contextually indexed object types. Thus, the types of candidate objects can be determined in a way that is most suitable for the scene, based on global context properties and / or regional context properties for the scene.
[188] By way of example, people appearing on the scene can be helpful in determining a suitable candidate object type for the scene. This may be due to the fact that insertion objects generally refer to people in some way, so that some insertion objects may appear natural close to some types of people, but they do not appear natural close to other types of people. . For example, the general perception may be that children are more interested in toys, and adults are more interested in clothing or household items. Therefore, if the scene descriptor includes a regional context descriptor in the Human classification that identifies the child attribute, it may be more appropriate to recommend toys for insertion in the image contents of the pictures. Consequently, the identification sub-module 620 can learn, through
Petition 870180124451, of 08/31/2018, p. 61/128
55/96 machine learning, what types of candidate objects that are indexed in the library with the context of children should be suitable for insertion in that scene.
[189] To consider another example, a soft drink manufacturer may have a range of different trademarks that are marketed to different categories of consumers. It is generally known that diet or soft drinks tend to be marketed more strongly to women. The identification sub-module 620 can recognize, through machine learning, that the candidate insertion zone and regional context descriptors and / or global context descriptors suggest that insertion of a refrigerant may be appropriate. For example, scene descriptors include a Kitchen location descriptor, a refrigerator shelf surface and a soft drink object near the candidate insertion zone in the refrigerator, in which case the 620 identification sub-module can perform a library search contextually indexed in the 630 database, and identify that the insertion of a soft drink may be appropriate (type of candidate object = soft drinks). This can be a very useful recommendation for object insertion. However, if the scene descriptor also identifies that the scene includes a woman, the search for the contextually indexed library can more specifically identify a trademark (or
Petition 870180124451, of 08/31/2018, p. 62/128
56/96 commercial brands) of soft drinks that tend to be marketed more intensely to women, in which case the type of candidate object can be adjusted for that particular brand (or brands). In this case, the candidate object type is more specific and can therefore be more useful for subsequent analysis and / or object insertion.
[190] It can be seen, therefore, that scene descriptors can be correlated with different types of objects, and machine learning can be used to learn these correlations. For example, the links between the detected examples of regional context descriptors for Local {room}, Human {child}, and Surface {floor} probably mean that a type of toy / game insertion object would be appropriate. A type of insertion object for DIY furniture accessories or spirits / liquors is probably not appropriate.
Insertion Probability Vector
[191] 0 submodule identification 620 can note every pixel in a plurality of frameworks of a scene with a vector of probability insertion a. 0 vector of
insertion probability a can be very similar to the regional context probability vector c described above, since it can have a plurality of probability values, all of which, with the exception of one, of the
Petition 870180124451, of 08/31/2018, p. 63/128
57/96 which correspond to an object type. The remaining probability value may correspond to an unsuitable mark for object insertion. Each of the probability values is indicative of the chance that the type of insertion indicated by the corresponding insertion mark is applicable to the pixel (for example, the values in the insertion probability vector can be indicative of a relative 'score' for each of the markings, which represent the relative chance that each of the markings is applicable to that pixel).
[192] While it may be preferable that each of at least some of the pixels is annotated with an insertion probability vector, for resolution reasons, in an alternative, each insertion probability vector can be related to a group of two or more pixels. For example, the pixels that make up a frame can be grouped into a series of subsets, with each subset comprising two or more pixels. In this case, each subset can be annotated with an insertion probability vector. Consequently, the identification sub-module 620 can be
configured for write down at least some From pixels (or individually or in groups of subset) with vectors in probability of insertion. [193] The values of probability at the vector in
insertion probability can take any form
Petition 870180124451, of 08/31/2018, p. 64/128
58/96 appropriate. For example, each can be a value between 0 to 1, 0 to 10, or 20 to 40, or 0 to 200, etc., with higher values indicating a greater chance. The sum of the probabilities in the insertion probability vector a can have 1 as a total result, or it can have any other suitable value as a result. If the insertion probability vector is configured to have probability values that add up to 1, the value that is not suitable for object insertion can be adjusted to 1 minus the sum of all other probability values. This annotation can be added to the annotated version of the plurality of frames previously received from the scene descriptor sub-module 610 (so that the plurality of frames includes the scene descriptor and insertion descriptor annotations), or it can be added to a version 'recent' of the frames (so that the plurality of frames includes only insertion descriptor annotations). The annotated tables, therefore, indicate the candidate insertion zone within the image content of the tables, as well as the corresponding one or more types of candidate objects.
[194] Thus, a candidate insertion zone can be identified by means of an area within the image contents of the frame that comprises a plurality of pixels that have probability insertion vectors, with all vectors having a maximum value argument in
Petition 870180124451, of 08/31/2018, p. 65/128
59/96 probability that corresponds to a mark that is indicative of a particular type of object. Such a particular object type is the type of candidate object for such a candidate insertion zone.
Visual Impact Score Modeling [195] The post-processing sub-module 640 can receive the plurality of annotated frames in order to identify clusters of pixels that are annotated with insertion probability vectors, where all the maximum arguments of the vectors correspond to the same mark (that is, the same type of candidate object). It can also determine the size, location and / or duration of the candidate insertion zone in the same way. The post-processing sub-module 640 can thus generate as output from the candidate insertion zone module 120 an indication of the one or more types of candidate objects for the identified insertion zone, and any other information from the zone descriptor. insertion you have determined (for example, the size, location and / or duration of the insertion zone).
[196] Optionally, the post-processing module 640 can also be configured to determine a Video Impact Score (VIS) for one or more of the identified candidate insertion zones. VIS can be included as one of the insertion zone descriptors, and can be used
Petition 870180124451, of 08/31/2018, p. 66/128
60/96 to access a potential impact of the insertion zone on video viewers. The VIS can be a multiplier for the quality score of an object insertion opportunity value to consider the highly variable nature of embedding objects in video content. The VIS can take any suitable form, for example, a number placed on a scale, such as a number between 0 and approximately 2 (although the scale can be of any size and granularity). In reality, VIS may not be allowed to be less than 1, and is usually between 1 and 2.
[197] The VIS for a candidate insertion zone can be calculated based on at least part of the insertion zone descriptor for the insertion zone, for example, based on the duration of the candidate insertion zone and / or the size of the insertion zone. candidate insertion zone.
[198] A non-limiting technical example for determining VIS is identified below. In this example, the VIS is based on the combination of an Exposure Score and a Context Score (although any other suitable function for determining the VIS using any one or more items from the insertion zone descriptor). These two scores are a weighted combination of several parameters that include Trademark Relevance, Duration, Hero Status, Proximity, Amplification, as defined below.
[199] Consider the following:
Petition 870180124451, of 08/31/2018, p. 67/128
61/96
[200] Calculate BETA Video Impact ScoreVIS = ES + CSES = Exposure ScoreCS = Context Score [201] Calculate Exposure Score E5 = IWID) + Ws / CS) + W _to A D = Qualifying Exposure Duration S = Average Exposure Size. . ~ f0, '.amplified A = Amplification = <„,. _r ·, ^{r 1 r} (1, amplified f (P) = Duration evaluation function f (S) = Size evaluation function [202] W = Weight The Context Score (CS) is a
weighted combination of specific metrics for embedding objects (particularly brand objects) in video content, with a focus on providing an assessment depending on the fit between the object (or trademark) and the content.
[203] CS can be between 0 and approximately 2
(although the scale can be of any size and granularity).
[204] The main term for determining SC can be
be the Trademark Relevance, which is used to determine whether the trademark fits the context (for example
Petition 870180124451, of 08/31/2018, p. 68/128
62/96 example, Vodka in a bar). If there is no Trademark Relevance, the score will be 0 and the CS will be 0. When we have Trademark Relevance, the Context Score is 1 or more, with the rest of the terms providing value boosts.
[205] Context Scoring can be performed as follows, although it is noted that where CS is used to determine VIS, CS can be determined in any other appropriate way (for example, by using only one or more among B, H and P identified below):
CS
_A. [0, '.match
B = Trademark Relevance = t, (1, correspondence
H = Hero Situation = | θ ' ! correspondence correspondence p = Proximity = | θ ' ! playing playing
[206] Thus, it will be noted that a VIS can be determined for a candidate insertion zone in a new video, based on at least some insertion zone descriptors. The VIS for a candidate insertion zone can be a useful technique for scaling candidate insertion zones to scale, or filtering insertion zones for candidates.
Petition 870180124451, of 08/31/2018, p. 69/128
63/96 poorer candidates, so that the number of candidate insertion zones for a new video that meets a particular video impact requirement (for example, that has a VIS greater than a threshold value) can be readily identified and the potential suitability is object insertion opportunities for the new video if already observed directly.
[207]
Alternatively, a post-processing module may not be used, the identification sub-module
620 can simply generate annotated frames as output, so that any other modules or submodules within system 100 (for example, object insertion module 140), or external to the system
100, can process the annotations to recognize the candidate insertion zones and the corresponding insertion zone descriptors.
Modeling the Indirect Approach [208]
Before the direct approach is described, it is worth considering some additional details of how the scene descriptor sub-module 610 and identification sub-module 620 can be implemented to perform machine learning and, in particular, how they can be trained. Preferably, in the indirect approach, we will use Neural Networks of
Convolution (CNN) for the recognition of scene descriptors and Vector Machines
Petition 870180124451, of 08/31/2018, p. 70/128
64/96 Support (SVM) for the recognition of insertion descriptors.
The Neural Convolution Network: a Bioinspired Mathematical Model [209] CNNs can be used to recognize different scene descriptors. A CNN is a network of learning units called neurons. A CNN is used to sequentially transform the initial content of the video frame image into an interpretable resource map that summarizes the image.
[210] CNN is biologically inspired by the advance feed processing of visual information and the organization of layers of neurons in the visual cortex. Like the different areas of the visual cortex, neurons in a CNN are grouped in layers, each neuron within the same layer performing the same mathematical operation.
[211] Typically, a layer on a CNN can perform (1) a convolutional operation, or (2) an activation operation, or (3) grouping operation or (4) an internal product operation. The first layers of a CNN perform convolutional operations on the image with a 2D bank of convolution filters. They vaguely model the behavior of retinal cells in area VI of the visual cortex, in the sense that they behave like Gabor filters and subsequently route signals to areas
Petition 870180124451, of 08/31/2018, p. 71/128
65/96 deeper into the visual cortex. A convolution filter also models the fact that adjacent retinal cells have overlapping receptive fields and respond similarly to an identical visual stimulus.
[212] So, like area V2 and other areas of the visual cortex, subsequent layers of CNN build higher-level resources by combining lower-level resources. However, caution is needed in the search for an analogy, because artificial neural networks do not exactly replicate the biological processes of learning visual concepts.
[213] In more detail, the scene descriptor 610 sub-module may need to be trained (1) to determine a global scene descriptor and (2) to determine regional context descriptors using pixel markup. In order to do this, the corpus of existing video material used for training must be noted in a similar manner. In order to explain the training process in more detail, it may be useful to first introduce some definitions.
Definitions [214] A CNN operates on tensors. By definition, a tensor is a multidimensional matrix, and is used to store and represent image data and CNN's intermediate data transformations, often called resource maps.
[215] Thus, an image can be represented as
Petition 870180124451, of 08/31/2018, p. 72/128
66/96 a 3D tensor
X _and rCxíIxW [216] where C, H, W respectively denote the image number of the channels, the image height and the image width. The RGB color value of the pixel is the 3D vector.
’X [l, í> j] 'x [2, í, j] [217] The output of a CNN depends on visual recognition tasks. Some examples of output are provided below.
[218] · In the image classification task, for example, of determining the Local global context descriptor for a given image x, the final output of a CNN is a probability vector y - CNN (x) [219] Where the kth coefficient [k] quantifies the probability that an image corresponds to class k, say a locality Cozinha, and the best descriptor Location for image x is determined as _ argmáxy [k] k
[220] · In the image segmentation task, for example, of determining the regional context descriptor, the final output of a CNN is a 3D vector vectors tensor
Petition 870180124451, of 08/31/2018, p. 73/128
67/96 probability, where each coefficient quantifies the probability that an image pixel (i, j) corresponds to class k, say 'Mesa' pixel. Thus, the best pixel mark is determined as the tensor defined by k * [ij] = ^air % ^max y [k, i, j].
k [221] The dimensionality of the tensors does not really matter, as the layers can operate on tensors of any dimension. When dealing with video data as input, CNNs are sometimes referred to as video networks in computer vision literature. In practice, it is sufficient and computationally more efficient to just use image data and explore temporal coherence using LongShort-Term-Memory (LSTM) networks. In particular, they are designed to handle an infinite sequence of data.
[222] Furthermore, in practice, it is more efficient to feed images per batch to a CNN than to feed an image by one. A batch of N images can be represented by a 4D x £ tensor [223] For video data, a batch of videos is 5D tensor
X € U ^ VxTxCx / fxW '[224] In the sequence, we restrict the description to
Petition 870180124451, of 08/31/2018, p. 74/128
68/96 image data and let the reader exercise the generalization of the subsequent definitions to the video data. As shown above, a CNN is made up of interconnected layers. A layer is a differentiable function. Differentiability is a central property at CNN, as it is a necessary condition to delay the spread of gradients during the training stage.
[225] As another analogy with physics, a CNN can be considered an electrical network in which tensors can be considered as electrical input or output signals, and a layer is an electrical component that filters electrical signals that arrive from incident layers. .
[226] Definition 1 We define a convolution neural network (CNN) as an acyclic graph directed G = (V, E) where each node v E V is a layer.
[227] Classic CNNs that are successful in image classification tasks are typically a chain of layers. Let's define the convolutional layer which is the most important layer used in a CNN.
[228] Definition 2 Let k be a tensor kernel in R '* C'XH'XW'. _The convolutional _layer with k is defined as the function that transforms an input tensor x E r ⁿ '* ^c ' * ^h '* ^w ' (for example, an image) into a tensor x * k G r ⁿ * ⁿ '* ^h '* ^w '
Petition 870180124451, of 08/31/2018, p. 75/128
69/96 [229] In words, the tensor nucleus k encodes the N filters of the convolutional nucleus (that is, N convolutional neurons) and, as an abusive simplification, the convolutional layer can be like a kind of local mediation operation applied in all samples of size C x H x W of each imagex [n .. Each resource vector y [n,., i, j] is a vector of dimension Ν 'that describes the pixel x [n., i, / ] of the nth image x [n.
[230] In the sequence, the nth image is also denoted by x _n ER ^CxHxW to lighten the notation.
[231] An important observation is that a convolutional operation is equivalent to a simple matrix-matrix product operation, which is how the popular deep learning packages implement it. Specifically, [232] 1. By forming a matrix φ (χ) of the HWxC'H'W form where each Wi + j line encodes an image fragment centered on (i, j) of the C'xH'xW 'form; and [233] by remodeling the k tensor kernel into a K 'matrix of size C’H’W’xN ’
K = (vec (ki); ...; vec (k _jV /)],
[234] So we see that [235] Property 1 the convolution of the tensor is equivalent matrix-matrix product= ψ (χ „) x/ ίΐν'χΑ ''
Petition 870180124451, of 08/31/2018, p. 76/128
70/96
[236] and the derivative of the convolution w.r.t. to kernel tensor k is x, * k ₉ κ ^{= 0} < ^χ ”> [237] Thus, the convolution of tensor of a batch of N
x images with the k kernel consists of the application of N matrix-matrix products, which the efficient linear algebra packages implement very efficiently. Note that the φ function can be implemented with the famous im2col function in MATLAB or Python.
[238] At each iteration of the training stage, the convolution gradient of the tensor is calculated to update the kernel weights k and is propagated back to the previous layers because of the chain rule.
[239] Next, a global scene probability vector is defined.
[240] Definition 3 A global scene probability vector is defined as a vector of arbitrary dimension, where the input of the kth vector is the confidence value for an attribute of only one classification of the global context descriptor.
[241] For example, the vector entries may correspond to the descriptors 'Kitchen', 'Living room', Location 'Urban' and so on.
[242] To identify the context descriptors
Petition 870180124451, of 08/31/2018, p. 77/128
71/96 regional in each pixel, it is assumed that we have a training set of images x _n , in which each pixel x _n [., I, /] is noted with a probability vector y _n [., I, j ]. This leads us to define a regional scene probability tensor.
[243] Definition 4 A regional context probability tensor c is defined as a probability vector tensor in [O ^ iJNxC / xhxiv _in q _{Ue c} [n _f k _f i _f j] quantifies a confidence value for k -th regional descriptor for each pixel x _n [., i, j].
[244] Note that the regional context probability tensor has the same width and height as the image tensor x. Only the depth of the tensioner is different.
[245] Multiobjective Loss and Weight Sharing Function. A dedicated CNN can be trained to predict each type of global context descriptor (location, mood and so on). Classically, the training stage is formulated as a parameter estimation problem. For this purpose, a differentiable loss function l (CNN (x), y) is necessary to measure the error between the estimated probability vector CNN (x) and the reference value probability vector y, where each input y [k] is 0 everywhere except that at some k index where the value is 1.
[246] Then, the training stage minimizes
Petition 870180124451, of 08/31/2018, p. 78/128
72/96 the sum of errors in all data (, Xi> yi)> i = 1 ,. . . > N, in the training data:
; V min (CNN ^ jk;, ... .kiv ij.cj
..... “Ι * Ί j ₌ | [247] with respect to the parameters (xi, yi), i = 1 ,. . ., N of each layer v that makes up CNN. The objective function is differentiable wrt the parameters k _v , v = l ... VVV the stochastic gradient descent method incrementally updates the parameters k _v , v = 1 ... VVV by feeding lots of images.
[248] Each CNN can be trained together in a computationally efficient manner in terms of speed and memory consumption as follows. First, we let them share the same convolutional layers. Only the last layers differ so that each CNN learns a specific global scene descriptor. Second, we define a multi-objective loss function as a (possibly weighted) sum of all errors
/ CNNi (x) 'Cl IS CNN _{; <} (x). ^C A '.
κ = £> f _t . (CNN _A. (x), c _A. ) i = l [24 9] Each CNN _k corresponds to the local estimator, mood estimator, and so on. They are applied to the image tensor x to estimate a global scene probability vector or a CNN _k (x) regional probability tensor. Each loss of function l _k evaluates the distance between the
Petition 870180124451, of 08/31/2018, p. 79/128
73/96 CNN estimation tensor _k (x) and the reference value tensor c _k . Thus, during the training stage, the errors of retroactive propagation of the multi-objective loss function allow the weights of the shared convolutional layers to become ideal for all classification tasks.
[250] As the regional context probability tensor, we define the insertion probability tensor as follows.
[251] Definition 5 An insertion probability tensor a is defined as a probability vector tensor in [0, l] ^{WxC, xHxlv} , where a [n, k, i, j] quantifies a confidence value for a insertion descriptor class.
[252] The insertion probability vector can only encode the insertion type of the insertion object, for example, vehicle, soda bottle, cell phone, etc. or not suitable for object insertion. Each On [., I, j] entry encodes the confidence value that, for example, the pixel x _n [., I, /] is:
[253] k = 1: not suitable for object insertion advertising, [254] k = 2: suitable for inserting a type of object product placement vehicle,
Petition 870180124451, of 08/31/2018, p. 80/128
74/96 [255] k = 3: suitable for inserting a type of object signal placement soda bottle, [256] k = 4: suitable for inserting a type of cell phone object.
[257] And so on.
[258] It will be noted that this is just a particular example of the types of objects that can be identified in the insertion probability vector, and that any number of additional or alternative object types can be identified in the insertion probability vector.
[259] The above definitions helped explain how the training image corpus can be annotated and, consequently, how a trained machine learning system can annotate the plurality of frames from the source video (for example, the scene 610 can be trained to annotate a global context probability vector and / or a regional context probability vector (or vectors) for each pixel in the frames, in the manner described above in relation to the scene descriptor probability vector, and the identification submodule 620 can be trained to annotate each pixel of the frames with an insertion probability vector described above). Therefore, we will now briefly describe the ways in which machine learning training can be carried out.
Interpreting Resource Maps in Recognition of
Petition 870180124451, of 08/31/2018, p. 81/128
75/96
Global Scene Descriptors
[260] We show below VGG-16, an example of architecture CNN used for classification of image to 1000 classes. [261] The figure 11 shows the results
CNN intermediaries at different stages after feeding an image. In this particular classification task, the CNN entry is an image and is represented
like a volume 3D of width 224, from height 224, of depth 3. [262] A output is the softmax block what is vector of 1000D probability. [263] 0 flow computational on a CNN it's like the
follow:
[264] · The image is first transformed into a 224x224x64 feature map after the first convolution + ReLU block. The feature map describes each pixel (i, j) G [1,222] χ [1,222] of the image with a 64D resource vector as a result of 64 different convolutional kernels.
[265] · 0 first feature map
224x224x64 is transformed in one second map in characteristics 224x224x64 after one second block in convolution + ReLU. Again, O second map in characteristics describes each pixel (i, j) and [1 , 224] χ [1,
Petition 870180124451, of 08/31/2018, p. 82/128
76/96
224] of the image with a 64D feature vector as a result of 64 different convolutional kernels.
[266] · The second resource map is then transformed by a maximum max-pooling layer into a third resource map 112x112x64. The resource map can be interpreted as a grid of 112x112 image blocks. Each block (i, j) corresponds to a 2x2 pixel non-overlapping image fragment (i, j) in the original image. Each block is described by a 64D resource vector (not 128D, as the image could be misleading).
[267] · The third feature map is then transformed by a convolution + ReLU block into a fourth feature map 112x112x64. Each block (i, j) corresponds to a 2x2 pixel non-overlapping image fragment (i, j) in the original image, and is described by a 128D resource vector as a result of 128 convolution kernels.
[268] · And so on, the reader will appreciate how the remaining resource maps are generated following the above reasoning.
[269] Consequently, we easily understand that a CNN creates a multi-scale representation due to the max-pooling operation. In the case of VGG-16, it was observed, namely, at the end of each max-pooling function, that the image is represented successively as:
[270] · a grid of 112x112 image blocks, being
Petition 870180124451, of 08/31/2018, p. 83/128 / 96 that each block describes a 2x2 pixel non-overlapping image fragment of the original image with a 64D vector;
[271] · a grid of 56x56 image blocks, each block describing a 4x4 pixel non-overlapping image fragment with a 256D resource vector;
[272] · a grid of 28x28 image blocks, each block describing an 8x8 pixel non-overlapping image fragment with a 512D resource vector;
[273] · a grid of 14x14 image blocks, each block describing a 16x16 pixel non-overlapping image fragment with the resource vector 512D.
[274] After that, the coarser grid of 14x14 image blocks is used eventually transformed into a 1000D probability vector of the last layers that are composed of internal product, dropout and softmax layers that form what is called a perceptron network. .
Recognition of Regional Context Descriptors [275] To calculate a specific regional context probability vector, the original architecture of the VGG16 is not directly suitable for pixel classification. However, we noted earlier that VGG-16 builds a multiscale (or pyramidal) representation of the input image. As a first approach, each pixel of the original image can be described by concatenating the vector of resources in each layer of the pyramid.
Petition 870180124451, of 08/31/2018, p. 84/128
78/96 [276] Intuitively, the single color value of the pixel is not always sufficient, even if it corresponds to a skin pixel, because the skin color is not uniform. However, if we analyze the average color of neighboring pixels with variable neighborhood size, it becomes increasingly obvious for the CNN model to infer that the pixel is in fact a skin pixel.
[277] Fully convolutional networks and variant networks explore and refine this intuition with unconvolutional layers.
Human Action Recognition Through LSTM Network [278] It is convenient to describe human activity by means of a sentence and an LSTM is designed to predict a sequence of words. To allow the machine to predict such a sequence of words, just replace the perceptron network with an LSTM network. Unlike the usual layers, the LSTM maintains a state, encoded by a cell's state vector. This vector can be thought of as a 'memory' built continuously from past predictions, and this is an aspect of the LSTM that guarantees the temporal coherence of the predictions.
[279] The LSTM is updated by a set of transition matrices and weight matrices. The matrices are the parameters optimized during the training stage. One of the functions of these matrices is to update the cell's state vector (memory) by properly considering the importance
Petition 870180124451, of 08/31/2018, p. 85/128
79/96 of the new forecast. We will no longer detail the mathematical mechanisms of the LSTM network, and the reader should only understand that an LSTM is just another differentiable function. Thus, a common stochastic gradient method during the training stage works normally.
[280] Experimentally such a network using VGG-16 + LSTM has shown impressive results in automatic captioning of images using.
Recognition of Insertion Descriptors [281] To recognize insertion descriptors, we use an SVM-based approach. An SVM is a useful classification algorithm to predict whether an object belongs to a certain class and can be used in supervised learning applications. An SVM-based classifier can only perform a binary classification. Although it may seem like a limitation, it can be generalized for a robust multiclass classification as follows.
[282] In the indirect approach, we train a dedicated SVM classifier for each brand category class, for example, Kitchen Utensils, using a one against all strategy, where the training data is composed of positive samples, that is, relevant images for kitchen utensils, and negatives, that is, irrelevant images for kitchen utensils.
Petition 870180124451, of 08/31/2018, p. 86/128
80/96 [283] After the training stage, each class-specific classifier calculates a prediction score for a new, unseen image. The same should provide a positive score when the image is suitable for that brand category and a negative score when it is not. The higher the score, the more suitable the image is for the brand category. It is then possible to establish a scale classification of brand categories. An advantage of using SVM instead of CNN is that we can learn to recognize a new brand category without having to start the learning process from scratch. Another advantage is that SVM will behave better than CNN, where classes are not mutually exclusive. For the brand category classification problem, a scene may be suitable for many brand categories. However, unlike CNN, SVM is unable to learn how to transform image data into an efficient resource vector. Instead, an SVM requires resource representation in advance to ensure good forecasting results for the appropriate recognition task.
Semi-supervised learning for less annotation of intensive work [284] There are a few ways to train learning systems. The easiest, but most laborious, approach is the supervised learning approach, where each
Petition 870180124451, of 08/31/2018, p. 87/128
81/96 training sample must be fully noted. In particular, for the prediction of the regional context descriptor, each pixel of the image can be annotated. The most difficult but least labor-intensive approach is the semi-supervised learning approach.
[285] Obtaining notes for each training video is an expensive and time-consuming task. In practice, it may be more efficient not to annotate all pixels for the regional context vector and instead provide a number of annotations that is not necessarily complete, but sufficient.
[286] In particular, we may allow the training to contain vaguely or partially annotated video shots, for example, bounding boxes, scribbles. Semi-supervised learning algorithms address these problems.
Temporal coherence using LSTM [287] Video Networks. Sub-module 610 can be extended to video data instead of image frames, due to the majority of convolutional neural networks. However, video networks are not practical. More importantly, it raises the question of appropriate video data across the time dimension, which potentially means losing information and a drop in accuracy in the forecasting task.
[288] LSTM and Variants. Instead, it is more efficient, in practice, to use the LSTM network to ensure
Petition 870180124451, of 08/31/2018, p. 88/128
82/96 temporal coherence instead of a perceptron network. The LSTM continues to be applicable to local detection, mood detection, regional context descriptor, blue box forecast, as it simply means replacing the perceptron network with an LSTM network on each corresponding CNN. Note that they are numerous variant methods that use the same principle as LSTM in semantic segmentation tasks. Let's mention, for example, the clock approaches.
[289] Figure 8 shows exemplary steps of a process 800 for training machine learning of scene descriptor sub-module 610 and identification sub-module 620, in order to determine scene descriptors as described above. The scene descriptor sub-module 610 can comprise a CNN, which, in step S802, is provided with a corpus of training images that are annotated with scene descriptor probability vectors as described above. This is to provide CNN with the means to learn the functions of images that can be associated with scene descriptors, as described above. The last layers of CNN can be used to extract generic visual recognition functions related to regional context descriptors and / or global context descriptors. The neural network model that comprises weights and predictions for scene descriptors is developed in step S804. The identification sub-module can comprise an SVM model
Petition 870180124451, of 08/31/2018, p. 89/128
83/96 to identify candidate insertion zones, and an additional CNN / CVM model to determine corresponding candidate object types. The neural network model comprising weights and predictions can be provided to the identification sub-module 620 to train an SVM to predict the most useful scene descriptors for use in determining the types of candidate objects in step S808. Before providing the generic visual recognition functions created from activations in the last layers of CNN for SVM, several S806 pre-processing steps can be implemented to refine the process at the SVM stage.
This can include L2 that normalizes the functions, or combine the functions from different fragments in the image.
[290] After having trained the scene descriptor sub-module 610 and the identification sub-module 620, they can then process the image contents of the plurality of frames of the source video as follows:
[291] 1. a CNN-based model in the scene descriptor sub-module 610 generates, for the scene, a heat map for each scene descriptor (for example, by determining a regional context probability vector for each pixel in the scene). plurality of frames, in which case a 2D heat map of regional context probability would be generated for each frame, with the temporal element due to the heat maps for the plurality of frames);
Petition 870180124451, of 08/31/2018, p. 90/128
84/96 [292] 2. an SVM model in the identification sub-module 620, which then identifies the candidate insertion zones within the image contents based on the scene descriptors;
[293] 3. an additional CNN / SVM model in the identification sub-module 620 then determines the corresponding insertion descriptors for each candidate insertion zone.
Direct approach [294] As explained above in relation to the indirect approach, there may be a correlation between particular scene descriptors and types of objects that are suitable for insertion into the scene. However, it was noted that in some cases, different scene descriptors can be orthogonal for two reasons:
[295] · For example, consider placing a bottle of wine on a dinner table. From a purely contextual point of view, the table-bottle association would seem more correct than the wall-bottle association. Therefore, all pixels on the table can be considered more relevant than a pixel on the wall for a wine bottle placement. Consequently, a correlation between table and bottle can be inferred, whereas a correlation between wall and bottle cannot.
[296] From the point of view of a content analyst
Petition 870180124451, of 08/31/2018, p. 91/128
85/96 or an incorporation artist, however, this may be slightly more subtle. First, due to the 3D geometry, the placed bottle will need to occupy at least a few pixels from the table and possibly a few pixels from the wall. Second, not all pixels on the table have an impact on the insertion of the object: if a character is sitting at a dining table, it can have more impact by inserting the bottle next to the character's hand, instead of at the other end of the table.
[297] · Statistical Properties of Learning of Insertion Zones. Our data shows that the insertion zones suitable for insertion of objects identified by content analysts often depend on their position in relation to other things in the image content. For example, they can choose insertion zones that are parts of a table that are close to the arms and hands of characters. Likewise, signage opportunities can often be of the type of outdoor building walls rather than indoor walls.
[298] In addition, specific types of objects relevant to different types of surfaces, for example, table top, work surface and bar counter, can be learned together.
[299] These two observations have a non-trivial consequence. Although the scene descriptors described above
Petition 870180124451, of 08/31/2018, p. 92/128
86/96 in relation to the indirect approach can be very useful, they may not be really necessary to identify candidate insertion zones and to determine types of candidate objects that are suitable for insertion into the candidate insertion zone. A machine learning system, for example one that uses Deep Neural Networks, may be able to capture the impressive statistical properties of insertion zones and therefore simultaneously identify candidate insertion zones and determine types of candidate objects for such zones. candidate insertion identified. This is called, in the present disclosure, the direct approach, since machine learning is used to identify and determine candidate insertion zones and types of candidate objects directly, in a single step, from the processing of content from image of the plurality of frames (in contrast to the indirect approach, in which the image contents of the plurality of frames are first processed by using machine learning to determine scene descriptors, and candidate insertion zones and types of candidate objects, then, determined in a second stage of machine learning from the scene descriptors).
[300] Figure 9 shows an exemplary schematic representation of a module configuration.
Petition 870180124451, of 08/31/2018, p. 93/128
87/96 candidate insertion zone 110 to perform the direct approach. As can be seen, the insertion zone and insertion object identification sub-module 910 receives the plurality of frames in a scene and processes the image contents of the frames to identify the candidate insertion zone (or candidates) and one or more types of candidate objects.
[301] The insertion zone and insertion object identification sub-module 910 may comprise a CNN model that can be trained in a similar manner to that described above. In this way, the insertion zone and insertion object identification sub-module 910 may be able to learn what kind of image characteristics (for example, types of scene descriptors, relative positioning of regional context descriptors) can determine the size and positioning of insertion zones, and, in turn, can lend themselves to the insertion of particular types of objects. Since in the training corpus objects would typically have been inserted into the image content for particular reasons, for example types of particular objects would have been inserted into the image due to the fact that they fit well with the rest of the image content, and / or objects can be inserted closer to particular characters, in order to increase the impact of the inserted object (as explained above), the
Petition 870180124451, of 08/31/2018, p. 94/128
88/96 insertion and the identification sub-module of
910 should inherently learn this insertion object from the training corpus. Consequently, when the trained insertion zone and the trained insertion object identification sub-module
910 process the plurality of frames of a new source video, it can naturally identify candidate insertion zones that must be in the best regions of the image content (for example, in the table and wall pixels next to a character's hand for inserting a bottle of wine, rather than in the table pixels distant from a character's hand, as described earlier in the 'indirect' approach section).
[302]
Similar to the identification submodule
620 described above, the insertion zone and the insertion object identification sub-module 910 can output an annotated version of the plurality of frames, the annotations comprising an insertion probability vector for each pixel. The post-processing submodule
920 can be configured to operate in the same way as the post-processing sub-module 640 described above, and output as an identification of the candidate insertion zone and corresponding insertion descriptor as described previously.
However, the postprocessing 920 submodule is optional, and in an alternative the candidate insertion zone module 110 can simply generate
Petition 870180124451, of 08/31/2018, p. 95/128
89/96 as output the plurality of annotated frames generated by the insertion zone and the insertion object identification sub-module.
[303] In the direct and indirect implementations described above, training for machine learning modules is performed using a corpus of training images that are annotated with scene descriptors and insertion descriptors. However, in some instances, a sufficiently large body of training material that comprises these notes may not be available. For example, there may be a large corpus of images that have been annotated by a content analyst or incorporation artist with insertion descriptors, but not any scene descriptors, since the content analyst or incorporation artist may have been busy only with objects inserted in such images. In this case, the direct approach can still be effective, since it can still learn, implicitly, the different characteristics of the images that led the content analyst or incorporation artist to choose the insertion zone and insertion object they chose. However, it may still be preferable for the machine learning module to learn how to recognize scene descriptors for images in order additionally, to improve its identification of candidate insertion zones and
Petition 870180124451, of 08/31/2018, p. 96/128
90/96 determination of types of candidate objects. In this case, where a training corpus comprising only insertion descriptors is available, other trained machine learning modules can be used as part of the training process.
[304] Figure 10 shows an example representation of a training system that comprises a trained machine learning module 1010 and a machine learning module that must be trained 1020. The machine learning module that must be trained 1020 can be the scene descriptor sub-module 610 and the identification sub-module 620 of the above indirect approach, or the insertion zone and the insertion object identification sub-module 910 of the above direct approach. In this example, a training corpus annotated with insertion zone descriptors is available. This is fed to both the trained machine learning module 1010 and the machine learning module to be trained 1020. The trained machine learning module 1010 can be trained to identify scene descriptors (for example, it can be trained to perform regional context recognition), so that it can identify scene descriptors for the image training corpus and feed them to the machine learning module that must be trained 1020 (for example, as with vector annotations from
Petition 870180124451, of 08/31/2018, p. 97/128
91/96 probability of the scene descriptor of the image training corpus). Thus, the machine learning module that must be trained 1020 can still be trained to operate as previously described by using an image training corpus that lacks scene descriptors, by using an existing trained machine learning module 1010 .
[305] Optionally, for both the direct and indirect approach described above, an operator or user can provide feedback on the identified candidate insertion zone and / or insertion zone descriptor for candidate 110 insertion zone module. Optional implementation is shown in Figure 12, which is very similar to Figure 1, but includes additional user / operator feedback.
[306] A user or operator can review the identified candidate insertion zone and / or insertion zone descriptor in any suitable form (for example, by reviewing the object insertion suggestion table and / or the suggestion table insertion zone, etc.) and assess their suitability for the image content of the plurality of frames. In this way, an operator or user specialized in the subject can use his object insertion expertise to assess the suitability of the candidate insertion zone and / or insertion zone descriptor that were determined, by the
Petition 870180124451, of 08/31/2018, p. 98/128
92/96 less in part, using machine learning.
[307] Feedback can take any suitable form, for example the user can indicate whether the identified candidate insertion zone and / or insertion zone descriptor are suitable, or not, for the image contents of the plurality of frames, or can rate suitability, for example, on a scale of 0 to 5, or 0 to 10, or 0 to 100, etc. The feedback can then be used to improve the machine learning algorithms that were used in the candidate insertion zone module 110, so that the quality or suitability of the candidate insertion zone and / or the insertion zone descriptor, determined in the future, can be improved.
[308] Subject matter experts will readily note that various changes or modifications can be made to the aspects of the disclosure described above, without departing from the scope of the disclosure.
[309] For example, optionally, system 100 may additionally comprise a final insertion module configured to receive an additional object or material for insertion into the source video scene, and generate output material comprising at least part of the source video. and the received object or additional material inserted in the candidate insertion zone. The received object or additional material can be of the type indicated by the type of object
Petition 870180124451, of 08/31/2018, p. 99/128
93/96 candidate. The additional object or material can be received, for example, from an additional material data store / library (which can be part of the system 100, or separate from it) due to recovery based on the insertion zone descriptor, or by any other means. In this way, the final insertion module can function similarly to the object insertion module 140, as described above, but instead of creating an object insertion suggestion frame, it can actually insert the object into the image content of the plurality of frames of the scene. The insertion itself can happen according to any standard techniques that would be well understood by experts in the subject. The receipt and insertion of the object or material can be automatic, or it can happen after receiving approval from a user who considered the candidate insertion zone and the type of object that were recommended to be suitable for insertion in the candidate insertion zone. In this way, a suitable object or additional material can be inserted into the image contents of a scene quickly and reliably.
[310] When insertion is automatic, system 100 can be configured in such a way that the only exit is the exit material that comprises the additional object or material inserted in the candidate insertion zone. Where insertion occurs after user approval, system 100
Petition 870180124451, of 08/31/2018, p. 100/128
94/96 can produce at least one of: an identification of the candidate insertion zone and types of candidate objects; the suggestion table for objective insertion; and / or the suggestion table for the insertion zone. Upon receipt of user approval, system 100 can then produce the output material that comprises the additional object or material inserted in the candidate insertion zone.
[311] Furthermore, Figures 1, 6, 9 and 10 comprise several interconnected modules / entities. However, the functionality of any two or more of the modules / entities can be realized by a single module, for example, the functionality of the candidate insertion zone module 110 and the object insertion module 140 can be implemented by a single entity or module. Likewise, any one or more of the modules / entities represented in the Figures can be implemented by two or more modules or interconnected entities. For example, the functionality of the scene descriptor sub-module 610 can be implemented as a system of interconnected entities that are configured to together, perform the functionality of the scene descriptor sub-module 610. The entities / modules represented in the Figures (and / or any two or more modules that can together perform the functionality of an entity / module in the Figures) can be colocalized in the same geographic location (for example
Petition 870180124451, of 08/31/2018, p. 101/128
95/96 example, within the same hardware device), or may be located in different geographic locations (for example, in different countries). They can be implemented as only part of a larger entity (for example, a software module within a server or
computer multipurpose) or as a dedicated entity. [312] The disclosure aspects described above can be implemented by software, hardware or a combination software and hardware. For example,
functionality of the candidate insertion zone module 110 may be implemented by software comprising computer-readable code, which when executed on the processor of any electronic device, performs the functionality described above. The software can be stored in any suitable computer-readable medium, for example, a non-transient, computer-readable medium, such as read-only memory, random access memory, CD-ROMs, DVDs, Blue-ray devices, magnetic tape, discs drives, solid state drives, and optical drives. The computer-readable medium can be distributed through computer systems coupled in a network, so that the computer-readable instructions are stored and executed in a distributed manner. Alternatively, the functionality of the candidate insertion zone module 110 can be implemented by an electronic device that is configured
Petition 870180124451, of 08/31/2018, p. 102/128
96/96 to perform this functionality, for example, by virtue of programmable logic, such as an FPGA.
[313] Figure 13 shows an exemplary representation of an electronic device 1300 comprising a computer readable medium 1310, for example a memory, comprising a computer program configured to perform the processes described above. The electronic device 1300 also comprises a processor 1320 for executing the computer-readable code of the computer program. It will be appreciated that the electronic device 1300 may optionally comprise any other suitable components / modules / units, such as one or more I / O terminals, one or more display devices, one or more computer-readable media, one or more additional processors, etc.

权利要求:
Claims (7)
[1]
1. System characterized by the fact that it comprises:
a candidate insertion zone module configured to:
receive a plurality of frames of a scene from a source video; and process, at least in part using machine learning, image content from the plurality of frames to:
identify a candidate insertion zone for the insertion of an object in the image content of at least some of the plurality of frames; and determining an insertion zone descriptor for the identified candidate insertion zone, the insertion zone descriptor comprising a candidate object type indicative of an object type that is suitable for insertion into the candidate insertion zone.
[2]
2/1 plurality of scene frames, an insertion probability vector that comprises a probability value for each of a plurality of insertion marks, where each probability value is indicative of the chance that the type of insertion indicated by corresponding insertion mark is applicable to the pixel.
2. System, according to claim 1, characterized by the fact that the candidate insertion zone module comprises:
an identification submodule configured to carry out the identification of the candidate insertion zone and the determination of the insertion zone descriptor for the identified candidate insertion zone, and for:
determine, for at least some of the pixels of the
Petition 870180124451, of 08/31/2018, p. 118/128
[3]
3/7 to determine a scene descriptor, where the determination of the candidate object type is based, at least in part, on the scene descriptor.
3. System, according to claim 2, characterized by the fact that the plurality of insertion markings comprises:
a mark indicating that the pixel is not suitable for inserting an object; and one or more markings indicative of one or more corresponding types of object.
[4]
4. System according to claim 2 or 3, characterized by the fact that the candidate insertion zone comprises a plurality of pixels that have insertion probability vectors, in which all have a maximum argument of probability values that correspond to a mark that is indicative of the type of candidate object.
[5]
5/7
11. System according to claim 10, characterized by the fact that the plurality of insertion markings comprises:
a mark indicating that the pixel is not suitable for inserting an object; and one or more markings indicating that one or more corresponding types of object are suitable for insertion into the pixel.
12. System according to claim 10 or 11, characterized by the fact that the candidate insertion zone comprises a plurality of pixels that have insertion probability vectors, in which all have a maximum argument of probability values that correspond to a mark that is indicative of the type of candidate object.
13. System according to any one of claims 1 to 12, characterized by the fact that the candidate insertion zone module is additionally configured for:
receiving feedback from an operator, where the feedback is indicative of the suitability of the identified candidate insertion zone and / or the type of candidate object for the image contents of the plurality of frames; and modify machine learning based, at least in part, on feedback.
Petition 870180124451, of 08/31/2018, p. 122/128
5. System according to any one of claims 1 to 4, characterized by the fact that the candidate insertion zone module comprises:
a scene descriptor sub-module configured to process, using machine learning, image contents of at least some of the plurality of frames
Petition 870180124451, of 08/31/2018, p. 119/128
[6]
6/7
14. Method for processing the image contents of a plurality of frames of a scene from a source video, the method being characterized by the fact that it comprises:
receive the plurality of frames from the original video scene; and process, at least in part using machine learning, image content from the plurality of frames to:
identify a candidate insertion zone for the insertion of an object in the image content of at least some of the plurality of frames; and determining an insertion zone descriptor for the identified candidate insertion zone, the insertion zone descriptor comprising a candidate object type indicative of an object type that is suitable for insertion into the candidate insertion zone.
15. Computer program characterized by the fact that it is to execute the method, as defined in claim 14, when executed in the processor of an electronic device.
16. Method for training a candidate insertion zone module to identify candidate insertion zones and one or more candidate objects for insertion into a scene.
Petition 870180124451, of 08/31/2018, p. 123/128
6. System, according to claim 5, characterized by the fact that:
the identification of the candidate insertion zone is based, at least in part, on the scene descriptor.
7. System according to claim 5 or 6, characterized by the fact that the scene descriptor comprises at least one regional context descriptor indicative of an entity identified in the scene.
8. System according to claim 7, characterized by the fact that the scene descriptor sub-module is configured to process, using machine learning, image content from the plurality of frames to determine, for at least some of the pixels of the plurality of scene frames, a regional context probability vector that comprises a probability value for each of a plurality of regional context markings, where each probability value is indicative of the chance that the type of entity indicated by corresponding regional context markup is applicable to the pixel.
Petition 870180124451, of 08/31/2018, p. 120/128
4J7
9. System, according to claim 8, characterized by the fact that the plurality of regional context markings comprises:
a mark indicating that the pixel is unrelated to anything; and at least one of:
an or more markings indicative in one human; an or more markings indicative in one animal; an or more markings indicative in one object; an or more markings indicative in a surface. 10. System, according to The claim
characterized by the fact that the candidate insertion zone module additionally comprises:
an insertion zone and insertion object identification sub-module configured to identify the candidate insertion zone and the types of candidate objects, processing, using machine learning, image contents of the plurality of frames to determine, for at least minus some of the pixels of the plurality of frames in the scene, an insertion probability vector comprising a probability value for each of a plurality of insertion marks, where each probability value is indicative of the chance that the type of insertion corresponding insertion point is applicable to the pixel.
Petition 870180124451, of 08/31/2018, p. 121/128
[7]
7/7 a source video, the method being characterized by the fact that it comprises:
receive a training corpus comprising a plurality of images, each annotated with identification of at least one insertion zone and one or more types of candidate objects for each insertion zone; and train the candidate insertion zone module, using machine learning and the training group, to process image content from a plurality of frames from the source video to:
identify a candidate insertion zone for the insertion of an object in the image content of at least some of the plurality of frames; and determining an insertion zone descriptor for the identified candidate insertion zone, the insertion zone descriptor comprising one or more types of candidate objects indicative of one or more types of object that are suitable for insertion into the insertion zone candidate.

类似技术:

公开号 | 公开日 | 专利标题

BR102018067373A2|2019-03-19|LEARNING MACHINE FOR IDENTIFYING TYPES OF CANDIDATE OBJECTS FOR VIDEO INSERTION

Kovashka et al.2016|Crowdsourcing in computer vision

Liu et al.2010|A hierarchical visual model for video object summarization

Panda et al.2017|Weakly supervised summarization of web videos

Kao et al.2015|Visual aesthetic quality assessment with a regression model

US9846845B2|2017-12-19|Hierarchical model for human activity recognition

Cao et al.2014|Look over here: Attention-directing composition of manga elements

Shih et al.2016|MSTN: Multistage spatial-temporal network for driver drowsiness detection

US20130117780A1|2013-05-09|Video synthesis using video volumes

CN108491766B|2021-10-26|End-to-end crowd counting method based on depth decision forest

CN103988232B|2016-10-12|Motion manifold is used to improve images match

Liu et al.2016|Improving visual saliency computing with emotion intensity

Amengual et al.2015|Review of methods to predict social image interestingness and memorability

Liu et al.2021|A 3 GAN: An Attribute-Aware Attentive Generative Adversarial Network for Face Aging

Gunawardena et al.2021|Real-time automated video highlight generation with dual-stream hierarchical growing self-organizing maps

Lu et al.2019|Aesthetic guided deep regression network for image cropping

Chen et al.2016|Predicting perceived emotions in animated GIFs with 3D convolutional neural networks

Lu et al.2019|An end-to-end neural network for image cropping by learning composition from aesthetic photos

Shao et al.2017|Scanpath prediction based on high-level features and memory bias

Lu et al.2020|Learning the relation between interested objects and aesthetic region for image cropping

Liu et al.2013|Semantic motion concept retrieval in non-static background utilizing spatial-temporal visual information

CN110737783A|2020-01-31|method, device and computing equipment for recommending multimedia content

Rupprecht et al.2018|Learning without prejudice: Avoiding bias in webly-supervised action recognition

Zhan et al.2017|Cross-domain shoe retrieval using a three-level deep feature representation

Bhowmik et al.2021|Evolution of automatic visual description techniques-a methodological survey

同族专利:

公开号 | 公开日

CN109614842A|2019-04-12|

US10671853B2|2020-06-02|

US20190065856A1|2019-02-28|

GB201714000D0|2017-10-18|

EP3451683A1|2019-03-06|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

WO2001031497A1|1999-10-22|2001-05-03|Activesky, Inc.|An object oriented video system|

US8930561B2|2003-09-15|2015-01-06|Sony Computer Entertainment America Llc|Addition of supplemental multimedia content and interactive capability at the client|

US8413182B2|2006-08-04|2013-04-02|Aol Inc.|Mechanism for rendering advertising objects into featured content|

US8479229B2|2008-02-29|2013-07-02|At&T Intellectual Property I, L.P.|System and method for presenting advertising data during trick play command execution|

GB0809631D0|2008-05-28|2008-07-02|Mirriad Ltd|Zonesense|

US9961388B2|2008-11-26|2018-05-01|David Harrison|Exposure of public internet protocol addresses in an advertising exchange server to improve relevancy of advertisements|

US9467750B2|2013-05-31|2016-10-11|Adobe Systems Incorporated|Placing unobtrusive overlays in video content|

US20160212455A1|2013-09-25|2016-07-21|Intel Corporation|Dynamic product placement in media content|

WO2017066874A1|2015-10-19|2017-04-27|Fatehali Dharssi|Methods and systems for processing digital video files for image insertion involving computerized detection of similar backgrounds|US11048779B2|2015-08-17|2021-06-29|Adobe Inc.|Content creation, fingerprints, and watermarks|

US10878021B2|2015-08-17|2020-12-29|Adobe Inc.|Content search and geographical considerations|

WO2018033137A1|2016-08-19|2018-02-22|北京市商汤科技开发有限公司|Method, apparatus, and electronic device for displaying service object in video image|

US10366302B2|2016-10-10|2019-07-30|Gyrfalcon Technology Inc.|Hierarchical category classification scheme using multiple sets of fully-connected networks with a CNN based integrated circuit as feature extractor|

US11144798B2|2016-11-21|2021-10-12|V7 Ltd.|Contextually aware system and method|

US10572775B2|2017-12-05|2020-02-25|X Development Llc|Learning and applying empirical knowledge of environments by robots|

US10853983B2|2019-04-22|2020-12-01|Adobe Inc.|Suggestions to enrich digital artwork|

CN110059642B|2019-04-23|2020-07-31|北京海益同展信息科技有限公司|Face image screening method and device|

CN113934886A|2020-06-29|2022-01-14|北京字节跳动网络技术有限公司|Transition type determination method and device, electronic equipment and storage medium|

WO2022018628A1|2020-07-20|2022-01-27|Sky Italia S.R.L.|Smart overlay : dynamic positioning of the graphics|

WO2022018629A1|2020-07-20|2022-01-27|Sky Italia S.R.L.|Smart overlay : positioning of the graphics with respect to reference points|

CN112507978B|2021-01-29|2021-05-28|长沙海信智能系统研究院有限公司|Person attribute identification method, device, equipment and medium|

法律状态:
2019-03-19| B03A| Publication of a patent application or of a certificate of addition of invention [chapter 3.1 patent gazette]|

2021-09-28| B08F| Application dismissed because of non-payment of annual fees [chapter 8.6 patent gazette]|Free format text: REFERENTE A 3A ANUIDADE. |

2022-01-18| B08K| Patent lapsed as no evidence of payment of the annual fee has been furnished to inpi [chapter 8.11 patent gazette]|Free format text: EM VIRTUDE DO ARQUIVAMENTO PUBLICADO NA RPI 2647 DE 28-09-2021 E CONSIDERANDO AUSENCIA DE MANIFESTACAO DENTRO DOS PRAZOS LEGAIS, INFORMO QUE CABE SER MANTIDO O ARQUIVAMENTO DO PEDIDO DE PATENTE, CONFORME O DISPOSTO NO ARTIGO 12, DA RESOLUCAO 113/2013. |

优先权:

申请号 | 申请日 | 专利标题

GBGB1714000.5|2017-08-31|

GBGB1714000.5A|GB201714000D0|2017-08-31|2017-08-31|Machine learning for identification of candidate video insertion object types|

[返回顶部]